{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pima Indians Diabetes Data Set数据探索\n",
    "\n",
    "数据说明：\n",
    "Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集） 根据现有的医疗信息预测5年内皮马印第安人糖尿病发作的概率。   \n",
    "\n",
    "数据集共9个字段: \n",
    "0列为怀孕次数；\n",
    "1列为口服葡萄糖耐量试验中2小时后的血浆葡萄糖浓度；\n",
    "2列为舒张压（单位:mm Hg）\n",
    "3列为三头肌皮褶厚度（单位：mm）\n",
    "4列为餐后血清胰岛素（单位:mm）\n",
    "5列为体重指数（体重（公斤）/ 身高（米）^2）\n",
    "6列为糖尿病家系作用\n",
    "7列为年龄\n",
    "8列为分类变量（0或1）\n",
    "\n",
    "数据链接：https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "import必要的工具包，用于文件读取／特征编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据文件路径和文件名"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>148</td>\n",
       "      <td>72</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>33.6</td>\n",
       "      <td>0.627</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>66</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>26.6</td>\n",
       "      <td>0.351</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>183</td>\n",
       "      <td>64</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>23.3</td>\n",
       "      <td>0.672</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>66</td>\n",
       "      <td>23</td>\n",
       "      <td>94</td>\n",
       "      <td>28.1</td>\n",
       "      <td>0.167</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>137</td>\n",
       "      <td>40</td>\n",
       "      <td>35</td>\n",
       "      <td>168</td>\n",
       "      <td>43.1</td>\n",
       "      <td>2.288</td>\n",
       "      <td>33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0          6                           148              72   \n",
       "1          1                            85              66   \n",
       "2          8                           183              64   \n",
       "3          1                            89              66   \n",
       "4          0                           137              40   \n",
       "\n",
       "   Triceps_skin_fold_thickness  serum_insulin   BMI  \\\n",
       "0                           35              0  33.6   \n",
       "1                           29              0  26.6   \n",
       "2                            0              0  23.3   \n",
       "3                           23             94  28.1   \n",
       "4                           35            168  43.1   \n",
       "\n",
       "   Diabetes_pedigree_function  Age  Target  \n",
       "0                       0.627   50       1  \n",
       "1                       0.351   31       0  \n",
       "2                       0.672   32       1  \n",
       "3                       0.167   21       0  \n",
       "4                       2.288   33       1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#input data\n",
    "train = pd.read_csv(\"C:/Users/14916/Desktop/s/pima-indians-diabetes.csv\")\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>3.845052</td>\n",
       "      <td>120.894531</td>\n",
       "      <td>69.105469</td>\n",
       "      <td>20.536458</td>\n",
       "      <td>79.799479</td>\n",
       "      <td>31.992578</td>\n",
       "      <td>0.471876</td>\n",
       "      <td>33.240885</td>\n",
       "      <td>0.348958</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>3.369578</td>\n",
       "      <td>31.972618</td>\n",
       "      <td>19.355807</td>\n",
       "      <td>15.952218</td>\n",
       "      <td>115.244002</td>\n",
       "      <td>7.884160</td>\n",
       "      <td>0.331329</td>\n",
       "      <td>11.760232</td>\n",
       "      <td>0.476951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.078000</td>\n",
       "      <td>21.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>99.000000</td>\n",
       "      <td>62.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>27.300000</td>\n",
       "      <td>0.243750</td>\n",
       "      <td>24.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>117.000000</td>\n",
       "      <td>72.000000</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>30.500000</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>0.372500</td>\n",
       "      <td>29.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>140.250000</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>127.250000</td>\n",
       "      <td>36.600000</td>\n",
       "      <td>0.626250</td>\n",
       "      <td>41.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>17.000000</td>\n",
       "      <td>199.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>99.000000</td>\n",
       "      <td>846.000000</td>\n",
       "      <td>67.100000</td>\n",
       "      <td>2.420000</td>\n",
       "      <td>81.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "count  768.000000                    768.000000      768.000000   \n",
       "mean     3.845052                    120.894531       69.105469   \n",
       "std      3.369578                     31.972618       19.355807   \n",
       "min      0.000000                      0.000000        0.000000   \n",
       "25%      1.000000                     99.000000       62.000000   \n",
       "50%      3.000000                    117.000000       72.000000   \n",
       "75%      6.000000                    140.250000       80.000000   \n",
       "max     17.000000                    199.000000      122.000000   \n",
       "\n",
       "       Triceps_skin_fold_thickness  serum_insulin         BMI  \\\n",
       "count                   768.000000     768.000000  768.000000   \n",
       "mean                     20.536458      79.799479   31.992578   \n",
       "std                      15.952218     115.244002    7.884160   \n",
       "min                       0.000000       0.000000    0.000000   \n",
       "25%                       0.000000       0.000000   27.300000   \n",
       "50%                      23.000000      30.500000   32.000000   \n",
       "75%                      32.000000     127.250000   36.600000   \n",
       "max                      99.000000     846.000000   67.100000   \n",
       "\n",
       "       Diabetes_pedigree_function         Age      Target  \n",
       "count                  768.000000  768.000000  768.000000  \n",
       "mean                     0.471876   33.240885    0.348958  \n",
       "std                      0.331329   11.760232    0.476951  \n",
       "min                      0.078000   21.000000    0.000000  \n",
       "25%                      0.243750   24.000000    0.000000  \n",
       "50%                      0.372500   29.000000    0.000000  \n",
       "75%                      0.626250   41.000000    1.000000  \n",
       "max                      2.420000   81.000000    1.000000  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#查看数值型特征的基本统计量\n",
    "train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从结果中我们可以看到很多列的最小值为0。而在一些特定列代表的变量中，0值并没有意义，这就表名该值无效或为缺失值。\n",
    "\n",
    "具体来说，下列变量的最小值为0时数据无意义：\n",
    "1、血浆葡萄糖浓度\n",
    "2、舒张压\n",
    "3、肱三头肌皮褶厚度\n",
    "4、餐后血清胰岛素\n",
    "5、体重指数\n",
    "\n",
    "在Pandas的DataFrame中，通过replace()函数可以很方便的将我们感兴趣的数据子集的值标记为NaN。\n",
    "\n",
    "标记完缺失值之后，可以利用isnull()函数将数据集中所有的NaN值标记为True，然后就可以得到每一列中缺失值的数量了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pregnants                         0\n",
      "Plasma_glucose_concentration      5\n",
      "blood_pressure                   35\n",
      "Triceps_skin_fold_thickness     227\n",
      "serum_insulin                   374\n",
      "BMI                              11\n",
      "Diabetes_pedigree_function        0\n",
      "Age                               0\n",
      "Target                            0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "NaN_col_names = ['Plasma_glucose_concentration','blood_pressure','Triceps_skin_fold_thickness','serum_insulin','BMI']\n",
    "train[NaN_col_names] = train[NaN_col_names].replace(0, np.NaN)\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对缺失值较多的特征，新增一个特征，表示这个特征是否缺失"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>Triceps_skin_fold_thickness_Missing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>45.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Triceps_skin_fold_thickness  Triceps_skin_fold_thickness_Missing\n",
       "0                         35.0                                    0\n",
       "1                         29.0                                    0\n",
       "2                          NaN                                    1\n",
       "3                         23.0                                    0\n",
       "4                         35.0                                    0\n",
       "5                          NaN                                    1\n",
       "6                         32.0                                    0\n",
       "7                          NaN                                    1\n",
       "8                         45.0                                    0\n",
       "9                          NaN                                    1"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['Triceps_skin_fold_thickness_Missing'] = train['Triceps_skin_fold_thickness'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "train[['Triceps_skin_fold_thickness','Triceps_skin_fold_thickness_Missing']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x2126fc81128>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAaV0lEQVR4nO3de5AV5Z3/8fdHLoIRRW4uMCDECxuIy6AjGqNZb7UQsiom0YJfst4wmCqSmE1kY1IpRaJVMZsE45roYrxgYlDWREXXS8SorNkojogI+KPE6E9GUBCVlUVUxu/vj36mOcKZ4QDTcwbm86o6Naeffrr7e87A+czT3adbEYGZmRnAXtUuwMzM2g+HgpmZ5RwKZmaWcyiYmVnOoWBmZrnO1S5gV/Tp0yeGDBlS7TLMzHYrzzzzzJsR0bfcvN06FIYMGUJ9fX21yzAz261I+n/NzfPuIzMzyzkUzMws51AwM7Pcbn1MwcysWj788EMaGhrYtGlTtUtpVrdu3aipqaFLly4VL+NQMDPbCQ0NDfTo0YMhQ4YgqdrlbCMiWLduHQ0NDQwdOrTi5bz7yMxsJ2zatInevXu3y0AAkETv3r13eCTjUDAz20ntNRCa7Ex9DgUzM8v5mIKZWStYt24dJ598MgCvv/46nTp1om/f7EvDCxYsoGvXrq2+zYULF7JmzRrGjh3bauvs8KFw5NRbq11Cu/HMv55d7RLMdlu9e/dm0aJFAEybNo19992Xiy++uOLlGxsb6dSp0w5tc+HChSxZsqRVQ8G7j8zMCnbqqady5JFHMmLECH79618DsHnzZnr27MkPf/hDRo8ezYIFC5g7dy7Dhg3j+OOP55vf/Cbjx48HYMOGDZx77rmMHj2aUaNGce+99/Lee+8xffp0brvtNmpra7nzzjtbpdYOP1IwMyvarFmz6NWrFxs3bqSuro4vfelL9OjRg/Xr13PEEUdwxRVXsHHjRg477DD+/Oc/M3jwYM4666x8+enTpzN27FhuueUW3n77bY4++mgWL17MpZdeypIlS7j66qtbrdbCRgqSuklaIOk5SUslXZ7ab5H0sqRF6VGb2iXpGkkrJC2WdERRtZmZtaUZM2YwcuRIPvOZz9DQ0MBLL70EQNeuXTnjjDMAWLZsGcOGDeOggw5CEhMnTsyX/+Mf/8iVV15JbW0tJ554Ips2beLVV18tpNYiRwrvAydFxAZJXYAnJD2Q5k2NiK3HOp8HDk2Po4Hr0k8zs93WvHnzmD9/Pk8++STdu3fnuOOOy7870L179/y00Yhodh0Rwd13383BBx/8sfb58+e3er2FjRQisyFNdkmP5l81nA7cmpZ7EugpqX9R9ZmZtYX169fTq1cvunfvztKlS3n66afL9hsxYgTLly9n5cqVRAR33HFHPm/MmDFcc801+fSzzz4LQI8ePXj33Xdbtd5CDzRL6iRpEbAGeDginkqzrky7iGZI2ju1DQRWlizekNq2XudkSfWS6teuXVtk+WZmu+wLX/gCGzduZOTIkUyfPp2jjy6/A2Sfffbh2muv5ZRTTuH4449nwIAB7L///gBcdtllbNy4kcMPP5wRI0Ywbdo0AE466SSee+45Ro0atXscaI6IRqBWUk/gLkmfBr4PvA50BWYC3wOmA+W+erfNyCIiZqblqKura2nkYWZWFU0f2pBdlO6hhx4q2++dd9752PQpp5zC8uXLiQguvPBC6urqAPjEJz7BDTfcsM3yffv2bfUbjbXJKakR8Q7wGDA2IlanXUTvAzcDo1O3BmBQyWI1wKq2qM/MrD247rrrqK2tZfjw4bz33nt87Wtfa/MaChspSOoLfBgR70jqDpwCXCWpf0SsVnZ0ZTywJC0yF/iGpNvJDjCvj4jVRdVnZtbeTJ06lalTp1a1hiJ3H/UHZknqRDYimRMR90n6UwoMAYuAr6f+9wPjgBXARuC8AmszM7MyCguFiFgMjCrTflIz/QOYUlQ9Zma2fb7MhZmZ5RwKZmaW87WPzMxaQWtfcbmSqxY/+OCDXHTRRTQ2NnLBBRdwySWX7PJ2PVIwM9sNNTY2MmXKFB544AGWLVvG7NmzWbZs2S6v16FgZrYbWrBgAYcccgif/OQn6dq1KxMmTOCee+7Z5fU6FMzMdkOvvfYagwZt+b5vTU0Nr7322i6v16FgZrYbKndV1aYrru4Kh4KZ2W6opqaGlSu3XEO0oaGBAQMG7PJ6HQpmZruho446ihdffJGXX36ZDz74gNtvv53TTjttl9frU1LNzFpBJaeQtqbOnTtz7bXXMmbMGBobGzn//PMZMWLErq+3FWozM7MqGDduHOPGjWvVdXr3kZmZ5RwKZmaWcyiYmVnOoWBmZjmHgpmZ5RwKZmaW8ympZmat4NXph7fq+gZf+vx2+5x//vncd9999OvXjyVLlmy3fyU8UjAz202de+65PPjgg626ToeCmdlu6nOf+xy9evVq1XUWFgqSuklaIOk5SUslXZ7ah0p6StKLku6Q1DW1752mV6T5Q4qqzczMyitypPA+cFJEjARqgbGSjgGuAmZExKHA28Ck1H8S8HZEHALMSP3MzKwNFRYKkdmQJrukRwAnAXem9lnA+PT89DRNmn+yWuPi4GZmVrFCjylI6iRpEbAGeBh4CXgnIjanLg3AwPR8ILASIM1fD/Qus87Jkuol1a9du7bI8s3MOpxCT0mNiEagVlJP4C7gU+W6pZ/lRgXb3FooImYCMwHq6uq2vfWQmVkVVHIKaWubOHEijz32GG+++SY1NTVcfvnlTJo0afsLtqBNvqcQEe9Iegw4BugpqXMaDdQAq1K3BmAQ0CCpM7A/8FZb1GdmtjuaPXt2q6+zyLOP+qYRApK6A6cALwCPAl9O3c4B7knP56Zp0vw/RbmbkJqZWWGKHCn0B2ZJ6kQWPnMi4j5Jy4DbJV0BPAvcmPrfCPxG0gqyEcKEAmszM7MyCguFiFgMjCrT/ldgdJn2TcCZRdVjZtbaIoL2fJLkzuxs8Teazcx2Qrdu3Vi3bt1OffC2hYhg3bp1dOvWbYeW8wXxzMx2Qk1NDQ0NDbTnU+O7detGTU3NDi3jUDAz2wldunRh6NCh1S6j1Xn3kZmZ5RwKZmaWcyiYmVnOoWBmZjmHgpmZ5RwKZmaWcyiYmVnOoWBmZjmHgpmZ5RwKZmaWcyiYmVnOoWBmZjmHgpmZ5RwKZmaWcyiYmVnOoWBmZrnCQkHSIEmPSnpB0lJJF6X2aZJek7QoPcaVLPN9SSskLZc0pqjazMysvCLvvLYZ+G5ELJTUA3hG0sNp3oyI+GlpZ0nDgQnACGAAME/SYRHRWGCNZmZWorCRQkSsjoiF6fm7wAvAwBYWOR24PSLej4iXgRXA6KLqMzOzbbXJMQVJQ4BRwFOp6RuSFku6SdIBqW0gsLJksQbKhIikyZLqJdW35xtmm5ntjgoPBUn7Ar8Hvh0R/wNcBxwM1AKrgZ81dS2zeGzTEDEzIuoioq5v374FVW1m1jEVGgqSupAFwm0R8QeAiHgjIhoj4iPgBrbsImoABpUsXgOsKrI+MzP7uCLPPhJwI/BCRPy8pL1/SbczgCXp+VxggqS9JQ0FDgUWFFWfmZltq8izjz4L/BPwvKRFqe0HwERJtWS7hl4BLgSIiKWS5gDLyM5cmuIzj8zM2lZhoRART1D+OMH9LSxzJXBlUTWZmVnL/I1mMzPLORTMzCznUDAzs5xDwczMcg4FMzPLORTMzCznUDAzs5xDwczMcg4FMzPLORTMzCznUDAzs5xDwczMcg4FMzPLORTMzCznUDAzs5xDwczMcg4FMzPLORTMzCxXUShIeqSSNjMz2721eI9mSd2AfYA+kg5gyz2X9wMGFFybmZm1se2NFC4EngH+Nv1setwD/LKlBSUNkvSopBckLZV0UWrvJelhSS+mnwekdkm6RtIKSYslHbGrL87MzHZMi6EQEb+IiKHAxRHxyYgYmh4jI+La7ax7M/DdiPgUcAwwRdJw4BLgkYg4FHgkTQN8Hjg0PSYD1+38yzIzs53R4u6jJhHxb5KOBYaULhMRt7awzGpgdXr+rqQXgIHA6cAJqdss4DHge6n91ogI4ElJPSX1T+sxM7M2UFEoSPoNcDCwCGhMzQE0GwpbLT8EGAU8BRzY9EEfEasl9UvdBgIrSxZrSG0fCwVJk8lGEgwePLiSzZuZWYUqCgWgDhie/orfIZL2BX4PfDsi/kdSs13LtG2zvYiYCcwEqKur2+F6zMyseZV+T2EJ8Dc7unJJXcgC4baI+ENqfkNS/zS/P7AmtTcAg0oWrwFW7eg2zcxs51UaCn2AZZIekjS36dHSAsqGBDcCL0TEz0tmzQXOSc/PITuTqan97HQW0jHAeh9PMDNrW5XuPpq2E+v+LPBPwPOSFqW2HwA/BuZImgS8CpyZ5t0PjANWABuB83Zim2ZmtgsqPfvo8R1dcUQ8QfnjBAAnl+kfwJQd3Y6ZmbWeSs8+epctB327Al2A/42I/YoqzMzM2l6lI4UepdOSxgOjC6nIzMyqZqeukhoRdwMntXItZmZWZZXuPvpiyeReZN9b8HcEzMz2MJWefXRqyfPNwCtkl6UwM7M9SKXHFHx6qJlZB1Dp7qMa4N/IvnsQwBPARRHRUGBt1sZenX54tUtoNwZf+ny1SzCrikoPNN9M9o3jAWQXqbs3tZmZ2R6k0lDoGxE3R8Tm9LgF6FtgXWZmVgWVhsKbkr4qqVN6fBVYV2RhZmbW9ioNhfOBs4DXye5v8GV8bSIzsz1Opaek/gg4JyLehuw+y8BPycLCzMz2EJWOFP6uKRAAIuItsjupmZnZHqTSUNhL0gFNE2mkUOkow8zMdhOVfrD/DPhvSXeSfU/hLODKwqoyM7OqqPQbzbdKqie7CJ6AL0bEskIrMzOzNlfxLqAUAg4CM7M92E5dOtvMzPZMDgUzM8s5FMzMLFdYKEi6SdIaSUtK2qZJek3SovQYVzLv+5JWSFouaUxRdZmZWfOKHCncAowt0z4jImrT434AScOBCcCItMyvJHUqsDYzMyujsFCIiPnAWxV2Px24PSLej4iXgRXA6KJqMzOz8qpxTOEbkhan3UtN35IeCKws6dOQ2rYhabKkekn1a9euLbpWM7MOpa1D4TrgYKCW7GqrP0vtKtM3yq0gImZGRF1E1PXt61s6mJm1pjYNhYh4IyIaI+Ij4Aa27CJqAAaVdK0BVrVlbWZm1sahIKl/yeQZQNOZSXOBCZL2ljQUOBRY0Ja1mZlZgVc6lTQbOAHoI6kBuAw4QVIt2a6hV4ALASJiqaQ5ZJfR2AxMiYjGomozM7PyCguFiJhYpvnGFvpfia+8amZWVb4nglk7deTUW6tdQrvxzL+eXe0SOgxf5sLMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyxUWCpJukrRG0pKStl6SHpb0Yvp5QGqXpGskrZC0WNIRRdVlZmbNK3KkcAswdqu2S4BHIuJQ4JE0DfB54ND0mAxcV2BdZmbWjMJCISLmA29t1Xw6MCs9nwWML2m/NTJPAj0l9S+qNjMzK6+tjykcGBGrAdLPfql9ILCypF9DajMzszbUXg40q0xblO0oTZZUL6l+7dq1BZdlZtaxtHUovNG0Wyj9XJPaG4BBJf1qgFXlVhARMyOiLiLq+vbtW2ixZmYdTVuHwlzgnPT8HOCekvaz01lIxwDrm3YzmZlZ2+lc1IolzQZOAPpIagAuA34MzJE0CXgVODN1vx8YB6wANgLnFVWXmZk1r7BQiIiJzcw6uUzfAKYUVYuZmVWmvRxoNjOzdqCwkYKZWWt5dfrh1S6h3Rh86fOFrt8jBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMws51AwM7OcQ8HMzHIOBTMzyzkUzMwsV5XbcUp6BXgXaAQ2R0SdpF7AHcAQ4BXgrIh4uxr1mZl1VNUcKZwYEbURUZemLwEeiYhDgUfStJmZtaH2tPvodGBWej4LGF/FWszMOqRqhUIAf5T0jKTJqe3AiFgNkH72K7egpMmS6iXVr127to3KNTPrGKpyTAH4bESsktQPeFjS/610wYiYCcwEqKuri6IKNDPriKoyUoiIVennGuAuYDTwhqT+AOnnmmrUZmbWkbV5KEj6hKQeTc+BfwCWAHOBc1K3c4B72ro2M7OOrhq7jw4E7pLUtP3fRcSDkp4G5kiaBLwKnFmF2szMOrQ2D4WI+Cswskz7OuDktq7HzMy2aE+npJqZWZU5FMzMLOdQMDOznEPBzMxyDgUzM8s5FMzMLOdQMDOznEPBzMxyDgUzM8s5FMzMLOdQMDOznEPBzMxyDgUzM8s5FMzMLOdQMDOznEPBzMxyDgUzM8s5FMzMLOdQMDOznEPBzMxy7S4UJI2VtFzSCkmXVLseM7OOpF2FgqROwC+BzwPDgYmShle3KjOzjqNdhQIwGlgREX+NiA+A24HTq1yTmVmH0bnaBWxlILCyZLoBOLq0g6TJwOQ0uUHS8jaqbY93EPQB3qx2He3CZap2BVbC/zZLtM6/zYOam9HeQqHcq42PTUTMBGa2TTkdi6T6iKirdh1mW/O/zbbT3nYfNQCDSqZrgFVVqsXMrMNpb6HwNHCopKGSugITgLlVrsnMrMNoV7uPImKzpG8ADwGdgJsiYmmVy+pIvFvO2iv/22wjiojt9zIzsw6hve0+MjOzKnIomJlZzqFgvrSItVuSbpK0RtKSatfSUTgUOjhfWsTauVuAsdUuoiNxKJgvLWLtVkTMB96qdh0diUPByl1aZGCVajGzKnMo2HYvLWJmHYdDwXxpETPLORTMlxYxs5xDoYOLiM1A06VFXgDm+NIi1l5Img38BRgmqUHSpGrXtKfzZS7MzCznkYKZmeUcCmZmlnMomJlZzqFgZmY5h4KZmeUcCmZmlnModDCSektalB6vS3qtZLrrVn0fktSjWrVuTdITkmrLtO9UnZKGS3pO0rOShjTTp7Okd5qZ91tJ41tY/3ckdatgPVMkfaWF9Zwi6e6WXktbkHSBpJD09yVtZ6a28Wn6ZknDdnC9Z0ia2tr12s5pV/dotuJFxDqgFkDSNGBDRPy0tI8kkX2HZUzbV7jjdqHOLwJ3RsSPWrOeEt8BbgI2tdQpIn5Z0PaL8DwwEXg8TU8AnmuaGRHn7egKI+Ku1inNWoNHCgaApEMkLZF0PbAQ6J++QdozzT9P0uL0l/XNqe1ASX+QVC9pgaRjUvsVkmZJelTSi5LOT+0D01/7i9K2jm2mls6SfiPp+dTvW1vN75T+Sp+Wphsk9Sx5DTdKWirpgaa/1Mts4zSyb3J/XdK81PYvafklkr5ZZpm9JP1K0jJJ9wJ9Wng//xnoB/xX0/pT+4/Te/gXSf1K3q9vp+eHSfpT6rNw6xGMpKOb2tNyN0p6XNJfJU0p6XdO+p0sSjXv1dz7Kumf02t6TtJvm3tNyWPAsWld+wGDgfwGOE2juR3ZVhqBXJ2e/1bSLyT9d3pNZ6T2TpKuT7/XeyU9qBZGabbzPFKwUsOB8yLi6wDZgAEkjQS+BxwbEW9J6pX6XwP8JCKeTB9e9wGfTvMOB44F9gMWSvpP4KvAvRFxlbKb+3Rvpo4jgT4RcXjafs+SeZ2B3wELI+KqMssOAyZGxPOS/gCMJ7tHxMdExFxJo4E3I+Lq9PwrZPeX6AQskPQ4sKxksS8DQ9NrHJDmXV/uBUTEDEnfBY6PiHckdQb2Bx6PiEsk/Rw4H/jxVovOBqZFxL0p0PYCDknvw/HADOC0iGhIv5/DgJOBnsALykL9U8AZZL+vzZJmkv1F/1Iz7+u/AAdFxAdbvdflfEQWDKcABwJ3p+1trbnfYSXb6gd8luzf0BzgLuBMsku6Hw78DdklWcq+97ZrPFKwUi9FxNNl2k8C7oiItwCafpJ9MFwvaRHZh8MBkpo+6O+OiE0RsQaYDxxFdvG9CyRdBnw6IjY0U8cKsmvd/ELSGGB9ybwbaT4QILth0PPp+TPAkO285ibHA7+PiI0R8W56Pcdt1edzwOyI+CgiGsg+HHfEexHxQHO1STqA7IP0XoD0/m1Msz8N/Ar4x7TtJvdFxAfpfX4L6Ev2ezkKqE+/m78HDqb593Up8FtlxzU+rOB13E4WMhMoE7jJrmzr7sgsZsu9PY4juy7XRxGxii27r6yVORSs1P820y7K32NBwOiIqE2PgRHxXpq3df+IiD8BJwCrgdvUzMHVdNzj74AngG8B/14y+8/AyZL2bqbW90ueN1L5aLjcfSXKlldhv3I+KHneXG3NrX9VWn7rA+3lXq+Am0p+L8Mi4kctvK9jyP7qHk0WJJ228zr+AhwB7BcRL5XrsIvbKn1N2uqnFcyhYJWYB0xo2m1UsvtoHlC6H7v0A2u8pL0l9SH7K7xe0kHA6xExk+zeu6PKbUxSX7ID3f8BXEb2AdRkZtru7WmXTGuZD5whqbukfcluSfpfZfpMSPvnB5L9Bd6Sd4GKz4qKiLeBNyWdCiCpm6R90uy3gH8EfpJ2I7VkHnBWeu+bzjgbXO59TR/KNSmwp5KNNPZpbsWpzgC+D/yguT6tta0STwBfVqY/2ajNCuBjCrZdEbFY0k+A+ZI2k+36mEQWCNdJOo/s39KjbAmJp4EHyG7gc1lEvKHsgPN3JH0IbCA7xlDOIOBGZTvNg+x4Rmk9P5F0JXCLpLNb6TUuUHaZ5qbdZ9el4xKl/0fuBE4kO7C6nCwkWjITmCdpJZXffP4rwL+n1/cB8KWSGlcrO0B+f0uvO9V9edr2XmS7ab5ONpLY+n3tDPxO2Sm9ewFXpd1nLYqI/9xOl3K/w7Lbajp2tR1zyHZjNr33T/Hx3YrWSnzpbGt1kq4gHcCtdi2255C0b0RsSKOQp4CjI2Jtteva03ikYGa7iwfSabBdyEafDoQCeKRgVSWpnm3/OPk/EbGsXP+d3Mb1wDFbNf88Im5tpfXPJTtfv9TFETGvXP/2TtIFZN/hKDU/Ir5Vrr/tWRwKZmaW89lHZmaWcyiYmVnOoWBmZjmHgpmZ5f4/XwfQEdy55YMAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "#color = sns.color_palette()\n",
    "\n",
    "%matplotlib inline\n",
    "sns.countplot(x=\"Triceps_skin_fold_thickness_Missing\", hue=\"Target\",data=train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "三头肌皮褶厚度是否缺省与标签关系不大。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x212703cbc50>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAVP0lEQVR4nO3df5DU9Z3n8ec7/DhQORUcXXFQ0KB3EMOgIyaXmE3UWwm5iCbRlbokGk1w90zKbK3WurmUEjbc5WqTmHXd9UqNK976c/NDiWehhoqhkl2DQAgCLpFED0ZRkBiii6hM3vdHf+drCz3QM0xPzzDPR1VX9/fT38/n++6uqXn190d/OjITSZIA3tHsAiRJA4ehIEkqGQqSpJKhIEkqGQqSpNLwZhewP4444oicOHFis8uQpEFlxYoVL2VmS63nBnUoTJw4keXLlze7DEkaVCLi/3X3nIePJEklQ0GSVDIUJEmlQX1OQZKa5c0336Sjo4OdO3c2u5RujRo1itbWVkaMGFF3H0NBknqho6ODMWPGMHHiRCKi2eXsITPZtm0bHR0dTJo0qe5+Hj6SpF7YuXMn48aNG5CBABARjBs3rsd7MoaCJPXSQA2ELr2pz1CQJJU8pyBJfWDbtm2cddZZALzwwgsMGzaMlpbKl4aXLVvGyJEj+3ybK1euZMuWLcycObPPxhzyoXDq1Xc0u4QBY8Vff7rZJUiD1rhx41i1ahUA8+bN45BDDuGqq66qu39nZyfDhg3r0TZXrlzJmjVr+jQUPHwkSQ320Y9+lFNPPZWpU6dy6623ArBr1y4OO+wwvvzlLzNjxgyWLVvGokWLOOmkkzjjjDP4whe+wHnnnQfAq6++yiWXXMKMGTOYPn06P/jBD3jttdeYP38+d955J21tbXznO9/pk1qH/J6CJDXawoULGTt2LDt27KC9vZ2Pf/zjjBkzhu3bt3PKKafw1a9+lR07dnDiiSfy05/+lGOPPZYLL7yw7D9//nxmzpzJ7bffzssvv8zpp5/O6tWrufbaa1mzZg3f+ta3+qxW9xQkqcGuv/56pk2bxnvf+146Ojr41a9+BcDIkSM5//zzAVi3bh0nnXQSxx13HBHBnDlzyv6PPPIICxYsoK2tjQ996EPs3LmTjRs3NqRW9xQkqYF++MMfsnTpUh5//HFGjx7N+9///vK7A6NHjy4vG83MbsfITO6//35OOOGEt7UvXbq0z+t1T0GSGmj79u2MHTuW0aNHs3btWp544oma602dOpX169ezadMmMpN77723fO6cc87hhhtuKJd//vOfAzBmzBheeeWVPq3XUJCkBvrIRz7Cjh07mDZtGvPnz+f000+vud5BBx3EjTfeyNlnn80ZZ5zB+PHjOfTQQwG47rrr2LFjByeffDJTp05l3rx5AJx55pn84he/YPr06Z5olqSBquufNlQmpXv44Ydrrvfb3/72bctnn30269evJzO5/PLLaW9vB+Dggw/mlltu2aN/S0tLn//QWMP2FCJiQkT8KCKeioi1EXFl0T4vIp6LiFXFbVZVn7+MiA0RsT4izmlUbZI0EN100020tbUxZcoUXnvtNT73uc/1ew2N3FPYBfx5Zq6MiDHAioh4tHju+sz8evXKETEFuAiYCowHfhgRJ2ZmZwNrlKQB4+qrr+bqq69uag0N21PIzM2ZubJ4/ArwFHDMXrrMBu7JzNcz8xlgAzCjUfVJkvbULyeaI2IiMB34WdH0+YhYHRG3RcThRdsxwKaqbh3UCJGImBsRyyNi+datWxtYtSQNPQ0PhYg4BPgu8MXM/B1wE3AC0AZsBr7RtWqN7ntcuJuZN2dme2a2d002JUnqGw0NhYgYQSUQ7szM7wFk5ouZ2ZmZvwdu4a1DRB3AhKrurcDzjaxPkvR2DTvRHJWv6X0beCozv1nVfnRmbi4WzwfWFI8XAXdFxDepnGieDCxrVH2S1Jf6esblemYtXrx4MVdeeSWdnZ189rOf5Zprrtnv7Tby6qP3AZ8CnoyIVUXbl4A5EdFG5dDQs8DlAJm5NiLuA9ZRuXLpCq88kqTaOjs7ueKKK3j00UdpbW3ltNNO49xzz2XKlCn7NW7DQiEzf0Lt8wQP7aXPAmBBo2qSpAPFsmXLeOc738nxxx8PwEUXXcQDDzyw36HgNBeSNAg999xzTJjw1mnY1tZWnnvuuf0e11CQpEGo1qyqXTOu7g9DQZIGodbWVjZteuurXR0dHYwfP36/xzUUJGkQOu2003j66ad55plneOONN7jnnns499xz93tcZ0mVpD5QzyWkfWn48OHceOONnHPOOXR2dnLppZcyderU/R+3D2qT1AB9fd37YNbf/3AHi1mzZjFr1qx9r9gDHj6SJJUMBUlSyVCQJJUMBUlSyVCQJJUMBUlSyUtSJakPbJx/cp+Od+y1T+5znUsvvZQHH3yQI488kjVr1uxz/Xq4pyBJg9Qll1zC4sWL+3RMQ0GSBqkPfOADjB07tk/HNBQkSSVDQZJUMhQkSSVDQZJU8pJUSeoD9VxC2tfmzJnDY489xksvvURraytf+cpXuOyyy/ZrTENBkgapu+++u8/H9PCRJKlkKEiSSoaCJPVSZja7hL3qTX2GgiT1wqhRo9i2bduADYbMZNu2bYwaNapH/TzRLEm90NraSkdHB1u3bm12Kd0aNWoUra2tPepjKEhSL4wYMYJJkyY1u4w+5+EjSVLJUJAklQwFSVKpYaEQERMi4kcR8VRErI2IK4v2sRHxaEQ8XdwfXrRHRNwQERsiYnVEnNKo2iRJtTVyT2EX8OeZ+R+B9wBXRMQU4BpgSWZOBpYUywAfBiYXt7nATQ2sTZJUQ8NCITM3Z+bK4vErwFPAMcBsYGGx2kLgvOLxbOCOrHgcOCwijm5UfZKkPfXLOYWImAhMB34GHJWZm6ESHMCRxWrHAJuqunUUbbuPNTcilkfE8oF8fbAkDUYND4WIOAT4LvDFzPzd3lat0bbHVwUz8+bMbM/M9paWlr4qU5JEg0MhIkZQCYQ7M/N7RfOLXYeFivstRXsHMKGqeyvwfCPrkyS9XSOvPgrg28BTmfnNqqcWARcXjy8GHqhq/3RxFdJ7gO1dh5kkSf2jkdNcvA/4FPBkRKwq2r4EfA24LyIuAzYCFxTPPQTMAjYAO4DPNLA2SVINDQuFzPwJtc8TAJxVY/0ErmhUPZKkffMbzZKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSoZCpKkkqEgSSo1LBQi4raI2BIRa6ra5kXEcxGxqrjNqnruLyNiQ0Ssj4hzGlWXJKl7dYVCRCypp203twMza7Rfn5ltxe2hYqwpwEXA1KLP30fEsHpqkyT1nb2GQkSMioixwBERcXhEjC1uE4Hxe+ubmUuB39RZx2zgnsx8PTOfATYAM+rsK0nqI/vaU7gcWAH8h+K+6/YA8He93ObnI2J1cXjp8KLtGGBT1TodRdseImJuRCyPiOVbt27tZQmSpFr2GgqZ+TeZOQm4KjOPz8xJxW1aZt7Yi+3dBJwAtAGbgW8U7VFr893UdHNmtmdme0tLSy9KkCR1Z3g9K2Xm30bEfwImVvfJzDt6srHMfLHrcUTcAjxYLHYAE6pWbQWe78nYkqT9V1coRMT/ofIJfxXQWTQn0KNQiIijM3NzsXg+0HVl0iLgroj4JpVzFZOBZT0ZW5K0/+oKBaAdmJKZNQ/p1BIRdwMfpHKSugO4DvhgRLRRCZRnqZyzIDPXRsR9wDpgF3BFZnbWGleS1Dj1hsIa4A+onAeoS2bOqdH87b2svwBYUO/4kqS+V28oHAGsi4hlwOtdjZl5bkOqkiQ1Rb2hMK+RRUiSBoZ6rz76caMLkSQ1X71XH73CW98bGAmMAP4tM/99owqTJPW/evcUxlQvR8R5OA2FpH6ycf7JzS5hwDj22icbOn6vZknNzPuBM/u4FklSk9V7+OhjVYvvoPK9hbq/syBJGhzqvfroo1WPd1H54tnsPq9GktRU9Z5T+EyjC5EkNV+9P7LTGhHfL35J7cWI+G5EtDa6OElS/6r38NE/AHcBFxTLnyza/nMjilJzeIXHWxp9hYc0UNV79VFLZv5DZu4qbrcD/piBJB1g6g2FlyLikxExrLh9EtjWyMIkSf2v3lC4FLgQeIHKTKmfADz5LEkHmHrPKfwVcHFmvgwQEWOBr1MJC0nSAaLePYV3dwUCQGb+BpjemJIkSc1Sbyi8IyIO71oo9hTq3cuQJA0S9f5j/wbwzxHxHSrTW1yIv5ImSQecer/RfEdELKcyCV4AH8vMdQ2tTJLU7+o+BFSEgEEgSQewXk2dLUk6MBkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqRSw0IhIm6LiC0RsaaqbWxEPBoRTxf3hxftERE3RMSGiFgdEac0qi5JUvcauadwOzBzt7ZrgCWZORlYUiwDfBiYXNzmAjc1sC5JUjcaFgqZuRT4zW7Ns4GFxeOFwHlV7XdkxePAYRFxdKNqkyTV1t/nFI7KzM0Axf2RRfsxwKaq9TqKtj1ExNyIWB4Ry7du3drQYiVpqBkoJ5qjRlvWWjEzb87M9sxsb2lpaXBZkjS09HcovNh1WKi431K0dwATqtZrBZ7v59okacjr71BYBFxcPL4YeKCq/dPFVUjvAbZ3HWaSJPWfun95raci4m7gg8AREdEBXAd8DbgvIi4DNgIXFKs/BMwCNgA7gM80qi5JUvcaFgqZOaebp86qsW4CVzSqFklSfQbKiWZJ0gBgKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSsObsdGIeBZ4BegEdmVme0SMBe4FJgLPAhdm5svNqE+Shqpm7il8KDPbMrO9WL4GWJKZk4ElxbIkqR8NpMNHs4GFxeOFwHlNrEWShqRmhUICj0TEioiYW7QdlZmbAYr7I5tUmyQNWU05pwC8LzOfj4gjgUcj4l/r7ViEyFyAY489tlH1SdKQ1JQ9hcx8vrjfAnwfmAG8GBFHAxT3W7rpe3Nmtmdme0tLS3+VLElDQr+HQkQcHBFjuh4DfwSsARYBFxerXQw80N+1SdJQ14zDR0cB34+Iru3flZmLI+IJ4L6IuAzYCFzQhNokaUjr91DIzF8D02q0bwPO6u96JElvGUiXpEqSmsxQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUslQkCSVDAVJUmnAhUJEzIyI9RGxISKuaXY9kjSUDKhQiIhhwN8BHwamAHMiYkpzq5KkoWNAhQIwA9iQmb/OzDeAe4DZTa5JkoaM4c0uYDfHAJuqljuA06tXiIi5wNxi8dWIWN9PtR3wjoMjgJeaXceAcF00uwJV8W+zSt/8bR7X3RMDLRRqvdp820LmzcDN/VPO0BIRyzOzvdl1SLvzb7P/DLTDRx3AhKrlVuD5JtUiSUPOQAuFJ4DJETEpIkYCFwGLmlyTJA0ZA+rwUWbuiojPAw8Dw4DbMnNtk8saSjwsp4HKv81+Epm577UkSUPCQDt8JElqIkNBklQyFOTUIhqwIuK2iNgSEWuaXctQYSgMcU4togHudmBms4sYSgwFObWIBqzMXAr8ptl1DCWGgmpNLXJMk2qR1GSGgvY5tYikocNQkFOLSCoZCnJqEUklQ2GIy8xdQNfUIk8B9zm1iAaKiLgb+BfgpIjoiIjLml3Tgc5pLiRJJfcUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0GSVDIUpEJE/ElEfLqPx7w9Ij5RPL61NzPQRsS8iMiIeGdV258Vbe3F8kMRcVgPx+3z16vBb0D9RrNUj4gYXnzprk9l5v/u6zF3G/+z+9H9SSrfNv9qsfwJYF3V2LN6UU9DX68GJ/cU1DQRcXBE/N+I+EVErImIP46IUyPixxGxIiIejoiji3Ufi4j/ERE/Bq6s/gRePP9qcf/Bov99EfHLiPhaRPzXiFgWEU9GxAl7qWdeRFxVtb3/VfT7ZUScUbRPLdpWRcTqiJgcEROrfwQmIq6KiHk1xn+s6pP9qxGxoHjtj0fEUft4u+6nmNI8Io4HtgNbq8Z+NiKOqPWeFs9/LSLWFTV/vQev96DivVwdEfdGxM+6XoMOTIaCmmkm8HxmTsvMdwGLgb8FPpGZpwK3AQuq1j8sM/8wM7+xj3GnAVcCJwOfAk7MzBnArcAXelDf8KLfF4HrirY/Af4mM9uAdioTCvbGwcDjmTkNWAp8bh/r/w7YFBHvAuYA93az3h7vaUSMBc4Hpmbmu3lrb2N3tV7vfwNeLvr9FXBqfS9Pg5WhoGZ6Eji7+IR6BpXZWt8FPBoRq4AvU5m1tUt3/wh390Rmbs7M14FfAY9UbW9iD+r7XnG/oqrfvwBfioi/AI7LzNd6MF61N4AHa4y/N/dQOYR0HvD9btZ523uamdupBMpO4NaI+Biwo5u+tV7v+4vtkplrgNV11KlBzFBQ02TmL6l88nwS+J/Ax4G1mdlW3E7OzD+q6vJvVY93Ufz9RkQAI6uee73q8e+rln9Pz86jdfXr7OqXmXcB5wKvAQ9HxJnVtRRG1TH2m/nWxGPl+PvwAyp7Phsz83e1Vtj9PY2Ia4vzLzOA71IJlMXdjL/H66X2723oAGYoqGkiYjywIzP/Efg6cDrQEhHvLZ4fERFTu+n+LG8dypgNjGhwuRQ1HQ/8OjNvoDLF+LuBF4EjI2JcRPw74L80YtvFXslf8PZDarvXt/t7ekpEHAIcmpkPUTk01NaDzf4EuLAYewqVQ3I6gHn1kZrpZOCvI+L3wJvAn1L51H1DRBxK5e/zW0CtqbxvAR6IiGXAEt6+F9FIfwx8MiLeBF4A5mfmmxExH/gZ8Azwr43aeGbes49Var2nY6i8V6OofPL/sx5s8u+BhRGxGvg5lcNH23tcuAYNp86W1K2IGAaMyMydxZVbS6icuH+jyaWpQdxTkLQ3BwE/iogRVPYy/tRAOLC5p6AhJyL+O3DBbs3/lJndHqvvDwO1Lg0thoIkqeTVR5KkkqEgSSoZCpKkkqEgSSr9fzjYz6Z8BgNnAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['serum_insulin_Missing'] = train['serum_insulin'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "sns.countplot(x=\"serum_insulin_Missing\", hue=\"Target\",data=train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不过特征是否缺失好像和目标也没什么关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.drop([\"Triceps_skin_fold_thickness_Missing\", \"serum_insulin_Missing\"], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "感觉特征缺失是随机的，将这新增的特征删除，老实用中值填补算了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pregnants                       0\n",
      "Plasma_glucose_concentration    0\n",
      "blood_pressure                  0\n",
      "Triceps_skin_fold_thickness     0\n",
      "serum_insulin                   0\n",
      "BMI                             0\n",
      "Diabetes_pedigree_function      0\n",
      "Age                             0\n",
      "Target                          0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "medians = train.median() \n",
    "train = train.fillna(medians)\n",
    "\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  get labels\n",
    "y_train = train['Target']   \n",
    "X_train = train.drop([\"Target\"], axis=1)\n",
    "\n",
    "#用于保存特征工程之后的结果\n",
    "feat_names = X_train.columns\n",
    "\n",
    "# 数据标准化\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# 初始化特征的标准化器\n",
    "ss_X = StandardScaler()\n",
    "\n",
    "# 分别对训练和测试数据的特征进行标准化处理\n",
    "X_train = ss_X.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 特征处理结果存为文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#存为csv格式\n",
    "X_train = pd.DataFrame(columns = feat_names, data = X_train)\n",
    "\n",
    "train = pd.concat([X_train, y_train], axis = 1)\n",
    "\n",
    "train.to_csv('FE_pima-indians-diabetes.csv',index = False,header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.639947</td>\n",
       "      <td>0.866045</td>\n",
       "      <td>-0.031990</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>0.166619</td>\n",
       "      <td>0.468492</td>\n",
       "      <td>1.425995</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.205066</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-0.852200</td>\n",
       "      <td>-0.365061</td>\n",
       "      <td>-0.190672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.233880</td>\n",
       "      <td>2.016662</td>\n",
       "      <td>-0.693761</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-1.332500</td>\n",
       "      <td>0.604397</td>\n",
       "      <td>-0.105584</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.073567</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.695245</td>\n",
       "      <td>-0.540642</td>\n",
       "      <td>-0.633881</td>\n",
       "      <td>-0.920763</td>\n",
       "      <td>-1.041549</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.141852</td>\n",
       "      <td>0.504422</td>\n",
       "      <td>-2.679076</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>0.316566</td>\n",
       "      <td>1.549303</td>\n",
       "      <td>5.484909</td>\n",
       "      <td>-0.020496</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0   0.639947                      0.866045       -0.031990   \n",
       "1  -0.844885                     -1.205066       -0.528319   \n",
       "2   1.233880                      2.016662       -0.693761   \n",
       "3  -0.844885                     -1.073567       -0.528319   \n",
       "4  -1.141852                      0.504422       -2.679076   \n",
       "\n",
       "   Triceps_skin_fold_thickness  serum_insulin       BMI  \\\n",
       "0                     0.670643      -0.181541  0.166619   \n",
       "1                    -0.012301      -0.181541 -0.852200   \n",
       "2                    -0.012301      -0.181541 -1.332500   \n",
       "3                    -0.695245      -0.540642 -0.633881   \n",
       "4                     0.670643       0.316566  1.549303   \n",
       "\n",
       "   Diabetes_pedigree_function       Age  Target  \n",
       "0                    0.468492  1.425995       1  \n",
       "1                   -0.365061 -0.190672       0  \n",
       "2                    0.604397 -0.105584       1  \n",
       "3                   -0.920763 -1.041549       0  \n",
       "4                    5.484909 -0.020496       1  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
